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w [ Abstract 

^ I We are studying long term sequence prediction (forecasting) . We approach 

^ O/ this by investigating criteria for choosing a compact useful state representa- 

tion. The state is supposed to summarize useful information from the history. 
We want a method that is asymptotically consistent in the sense it will prov- 
ably eventually only choose between alternatives that satisfy an optimality 
C^^ . property related to the used criterion. We extend our work to the case where 

^^ I there is side information that one can take advantage of and, furtherinore, 

we briefly discuss the active setting where an agent takes actions to achieve 
desirable outcomes. 
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1 Introduction 

When studying long term sequence prediction one is interested in answering ques- 
tions like: What will the next k observations be? How often will a certain event 
or a sequence of events occur? What is the average rate of a variable like cost or 
income? This can be interesting for forecasting time series and for choosing policies 
with desirable outcomes. 

Hidden Markov Models |CMR05[ IEM02J are often used for long term forecasting 
and sequence prediction. In this article we will restrict our study to models based 
on states that result from a deterministic function of the history, in other words, 
states that summarize useful information that has been observed so far. We will 
consider finite state space maps with the property that given the current state and 
the next observation we can determine the next state. These maps are sometimes 
called Probabilistic-Deterministic Finite Automata (PDFA) |VTdlH"'"5aj and they 
have recently been applied in reinforcement learning [MahlOj . A particular example 
of this is to use suffix trees |Ris83t ISinOGj IMcC96] . 

Our goal is to prove consistency for our penalized Maximum Likelihood criteria 
for picking a map from histories to states in the sense that we want to eventually only 
choose between alternatives that are optimal for prediction. The sense of optimality 
could relate to predicting the next symbol, the next k symbols or to have minimal 
entropy rate for an infinite horizon. 

After the preliminary Section [2] we begin our theory development in Section 
[3l In our problem setting we have a finite set 3^, a sequence ?/„ of elements from 
y, and we are interested in predicting the future of the sequence ?/„. To do this, 
being inspired by |Hut09j where general criteria for choosing a feature map for 
reinforcement learning were discussed, we first want to learn a feature map $(i/i:„) = 
Sn where yi;t := yi,....,yt. 

We would like the map to have the following properties: 

1. The distribution for the sequence s„ induced by the distribution for the se- 
quence yn should be that of a Markov chain or should be a distribution which 
is indistinguishable from a Markov chain for the purpose of predicting the 
sequence ?/„. 

2. We want as few states as possible so that we can learn a model from a modest 
amount of data. 

3. We want the model of the sequence yn that arises as a function of the Markov 
chain s„ to be as good as possible. Ideally it should be the true distribution. 

Our approach consists of defining criteria that can be applied to any class of $, 
but later we restrict our study to a class of maps that are defined by finite-state 
machines. These maps are defined by introducing a deterministic function ip such 
that Sn = i^^Sn-i, yn)- If wc havc chosen such a map ip and a first state Sq then the 



sequence ?/„ determines a unique sequence s„ and therefore we have also defined a 
map $(yi:„) = Sn- 

In Section [2] we provide some prehminaries on random sequences and Hidden 
Markov Models. We introduce a class of ergodic sequences which is the class of se- 
quences that we work with in this article. They are sequences with the property that 
an individual sequence determines a distribution over infinite sequences. We present 
our consistency theory by first presenting very generic results in the beginning of 
Section [3] and then we show how various classes of maps and models fit into this. 
This has the consequence that we first have results where we guarantee optimality 
given that the individual sequence that we work with has certain properties (and 
these results, therefore, have no "almost sure" in the statement since the setting is 
not probabilistic) while in the latter part we show that if we sample the sequence in 
certain ways we will almost surely get a sequence with these properties. In particu- 
lar in Section H] we will take a closer look at suffix tree sources and maps based on 
finite state machines related to probabilistic deterministic finite automata. Section 
[5] summarizes the findings in a main theorem that says under some assumptions (a 
class of maps based on finite state machines of bounded memory and ergodicity) we 
will recover the true model (or the closest we can get to the true model). Section [6] 
contains a discussion of sequence prediction with side information. Section [7| briefiy 
discusses the active case where an agent acts in an environment and earns rewards, 
and finally Section [H] contains our conclusions. 



2 Preliminaries 

In this section we will review some notions and results that the rest of the article will 
rely upon. We start with random sequences and then follows a section on Hidden 
Markov Models (HMM). 

Random Sequences. Consider the set of all infinite sequences yt,t = 1,2,... of 
elements from a finite alphabet 3^. We equip the set with the a-algebra that is 
generated by the cylinder sets F^^.^ = {xi:oo| Xt = yt,t = 1, ...,n}. A measure with 
respect to this space is determined by its values on the cylinder sets. Not every set 
of values is valid. We need to assume that the measure of F^^.^ is the sum of the 
measures of the sets F^^.^^ for all possible y E y. If we want it to be a probability 
measure we furthermore need to assume that the measure of the whole space y°° 
(which is the cyhnder set F^ of the empty string e) equals to one. The concept that 
is introduced in the following two definitions is of central importance to this article. 
In particular ergodic sequences is the class of sequences that we intend to model. 
They are sequences that can be used to define a distribution over infinite sequences 
that we will be interested in learning. 

Definition 1 (Distribution defined from one sequence) A sequence yi-.oo de- 
fines a probability distribution on infinite sequences if the (relative) frequency of every 



finite substring of yi-.oo converges asymptotically. The probabilities of the cylinder 
sets are defined to equal those limits: 

r^j^„^ := lim„^oo #{t < n : yt+v.t+rn = Zi-mj/n 

Definition 2 (ergodic sequence) We say that a sequence is ergodic if the fre- 
quencies of every finite substring are converging asymptotically. 

As probabilistic models for random sequences we will in this article focus on 
Hidden Markov Models (HMMs) |BP66[ IPet69] . More recent surveys on Hidden 
Markov Models are |EM021[CMH05] . 

Hidden Markov Models. Here we define distributions over sequences of elements 
from a finite set y of size Y based on an unobserved Markov chain of elements from 
a finite state set S of size S. 

Definition 3 (Hidden Markov Model, HMM) Assume that we have a Markov 
chain with an S x S transition matrix T = (Ts^^/) and that we also have an S x Y 
emission matrix E = {Eg^y) where Es^y is the probability that state s will generate 
outcome y E y. If we introduce a starting probability vector we have defined a 
probability distribution over sequences of elements from y. This is called a Hidden 
Markov Model (HMM). 

Sequence Prediction. One use of Hidden Markov Models (and functions of 
Markov chains) is sequence prediction. Given a history yi, ...,?/„ we want to pre- 
dict the future yn+i, ■■■■ In some situations we know what state we are in at time 
n and that state then summarizes the entire history without losing any useful 
information since the future is conditionally independent of the past, given the 
current state. If we are doing one step prediction we are interested in knowing 
Pr{yn+i\sn)- We can also consider a zero step lookahead (called filtering) Pr{yn\sn) 
or an m step Pr{yn+i, ...,yn+m\sn)- The m step could also be just Pr{yn+m\sn)- In 
a sense we can consider an infinite lookahead ability evaluated by the entropy rate 
— limm-).oo -^ log Pr{yn+i, •••, yn+m\sn)- If the Markov chain is ergodic this limit does 
not depend on the state Sn- 

Limit Theorems. The following theory that is the foundation for studying consis- 
tency of HMMs was developed in |BP66j and |Pet69j . See |CMR05j chapter 12 for 
the modern state of the art. 

Definition 4 (ergodic Markov chain) A Markov chain (and the stochastic ma- 
trix that contains its transition probabilities) is called ergodic if it is possible to move 
from state s to state s' in a finite number of steps for all s and s' . 

The following theorem |CMR05J introduces the generalized cross-entropy H and 
shows that it is well defined and that it can be estimated for ergodic HMMs. It 
can be interpreted as the (idealized) expected number of bits needed for coding a 
symbol generated by a distribution defined by Oq but using the distribution defined 
by^. 



Theorem 5 (ergodic HMMs) // 9 and Oq are HMM parameters where the tran- 
sition matrix for Oq is an ergodic stochastic matrix, then there exists a finite number 
H{6o, 6) (which can also be defined as lim„_^oo Hn,s{0o, d) for any initial state s where 
H.a^sido,0) := ^Ee^^\ogPr{yi,...,yn\so = s,9)) such that P^,, a.s. 

- lim -\ogPr{y,,...,y^\9) = H{9o,e) 

n-+oo n 

and the convergence is uniform in the parameter space. 

Definition 6 (Equivalent HMMs) For an HMM Oq, let M[6'o] be the set of all 9 

such that the HMM with parameters 9 define the same distribution over outcomes 
as the HMM with parameters 9q. 

Theorem 7 (Minimal cross-entropy for the truth and only the truth) 

H{9o,9) > H{9o,9o) with equality if and only if 9 e M[9o]. 

3 Maps From Histories To States 

Given a sequence of elements yn from a finite alphabet we want to define a map 
$ : 3^* — )■ iS, which maps histories (finite strings) of elements to states ^{yi-.n) = Sn- 
The reasons for this include, as was explained in the introduction, in particular the 
ability to learn a model efficiently. Suppose that every $ under consideration is such 
that the size of its state space 5 is a finite number that depends on $. 

We are also interested in the case when we have side information x„ G A' and 
we define a map $ : (A' x 3^)* — )■ 5. In this more general case the models that we 
consider for the sequence y will have hidden states while in the case without side 
information the state (given the y sequence) is not hidden. We have two reasons for 
expressing everything in an HMM framework. We can model long-range dependence 
in the yn sequence through having states and we include the more general case where 
there is side information. 

Definition 8 (Feature sequence/process) A map $ from finite strings of ele- 
ments from y (or X x y) to elements in a finite set S and a sequence yi-n induces 
a state sequence Si:„. Define an HMM through maximum likelihood estimation: The 
sequence St = ^{yi.t) gives transition matrix T{n) = {Tg^s') of probabilities 

rr t ^ #{^ <n\st = s, St+i = s'} 
Ts,s'{n) := 

and emission matrix E{n) of probabilities 

p (^^ ._ #{t < n\st = s, yt = y} 
'''^ ' •" #{t<n|., = s} • 

Denote those HMMs by 9n '■= {T[n),E{n)). We will refer to the sequence 9n as the 
parameters corresponding to $ or generated by $. 



We will first state results based on some generic properties that we have defined 
with just the goal of making the proofs work. Then we will show that some more 
easily understandable cases will satisfy these properties. We structure it this way 
not only for generality but also to make the proof techniques clearer. 

Ergodic Sequences. We begin by defining the fundamental ergodicity properties 
that we will rely upon. We provide asymptotic results for individual sequences that 
satisfy these properties. In the next two subsections we identify situations where we 
will almost surely get such a sequence which satisfies these ergodicity properties. 

Definition 9 (ergodic w.r.t. $) As stated in Definition\^ we say that a sequence 
yt is ergodic if all substring frequencies converge as n —i- oo. Furthermore we say 
that 

1. the sequence yt is ergodic with respect to a map ^{yi:t) = St if all state transition 
frequencies Ts,s'{n) and emission frequencies Es^y{n) converge as n ^ oo. 

2. the sequence yt is ergodic with respect to a class of maps if it is ergodic with 
respect to every map in the class. 

Definition 10 (HMM-ergodic) We say that a sequence yt is HMM-ergodic for a 
set of HMMs if there is an HMM with parameters 6q such that 

-^ log Pr(yi, ...,!/„ I 6) ^ H{eo,e) 

uniformly on compact subsets ofQ. 

Definition 11 (Log-likelihood) L„($) = — log Pr{yi, ...,yn\On) 

We will prove our consistency results by first proving consistency using Maximum 
Likelihood (ML) for a finite class of maps and then we prove that we can add a 
sublinearly growing model complexity penalty and still have consistency. 

Proposition 12 (HMM consistency of ML for finite class) Suppose that yt 
is HMM-ergodic for the parameter set G with optimal parameters (in the sense of 
Definition IT0\) 6q, yt is ergodic for the finite class of maps {$i}^i and suppose that 
9i G are the limiting parameters generated by $,. Then it follows that there is 
N < oo such that for all n > N the map $j selected by minimizing Ln generates 
parameters 6^ whose limit is in argmin^^ if (6'o, 6'j). 

Proof. It follows from Definition [TO] and continuity (in 6) of the log-likelihood that 

lim -L„(<l>,) = H{eo,e,) 

rn-oo n 

since the convergence in Definition [10] is uniform. Note that the parameters that 
define the log-likelihood L„($j) can be different for every n so the uniformity of the 
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convergence is needed to draw the conclusion above. By Definition [9] we know that 
if 91^ are the parameters generated by $j at time n, then lim„_>.oo On = ^i exists for 
all i. It follows that if 6i ^ argming. H{6o, 6j) then there must be an A^ < oo such 
that $4 is not selected at times n > N. Since there are only finitely many maps in 
the class there will be a finite such A^ that works for all relevant i. ■ 

Definition 13 (HMM Cost function) // the HMM with parameters On that has 
been estimated from, $ at time n has S states, then let 

Costni^) = -logPr{yi,...,yn\9n)+pen{n,S) 

where pen{n, S) is a positive function that is increasing in both n and S and is such 
that pen{n, S)/n — )• for n — )■ oo for all S. 

We call the negative log-probability term the data coding cost and the other term 
is the model complexity penalty. They are both motivated by coding (coding the data 
and the model). For instance in MDL/MML/BIC, pen(n, S") = | logn-l-O(l), where 
d is the dimensionality of the model 6. 

Proposition 14 Suppose that $o has optimal limiting parameters 6q with as few 
states as possible. In other words if an HMM has fewer states than the HMM defined 
by 6q, then it has a strictly larger entropy rate. We use a (finite, countable, or 
uncountable) class of maps that includes only $o o^i^d maps that have strictly fewer 
states. We assume that all the maps generate converging parameters. Then there is 
an N such that the function Cost is minimized by $0 (^t M times n > N. 



Proof. Suppose that 9q has Sq states. We will use a bound for how close one can get 
to the true HMM using fewer states. We would like to have a constant e > such 
that H{6o, 6) > H{6o, 6'o)+e for all 6 with fewer then 5*0 states. The existence of such 
an e follows from continuity of H (which is actually also differentiable |BP66] ). the 
fact that the HMMs with fewer than Sq states can be compactly (in the parameter 
space) embedded into the space of HMMs with exactly 5*0 states, and that this 
embedded subspace has a strictly positive minimum Euclidean distance from ^0 iii 
this parameter space. 

The existence of e > with this property implies the existence oi D > such 
that the alternatives with fewer than Sq states have, for large n, at least Dn worse 
log probabilities than the distribution ^o- Therefore the penalty term (for which 
pen{n,S)/n — ?■ 0) will not be able to indefinitely compensate for the inferior mod- 
eling. ■ 

Theorem 15 (HMM consistency of Cost for finite class) Proposition {IE is 
also true for Cost. 



Proof. H{6Q,6k) < H{6q,0j) implies that there is a constant C > such that for 
large n, L„($j) — L„($fc) > Cn. Since pen{n, S*)/??. — )■ for n — )• oo we know that 
any difference in model penalty will be overtaken by the linearly growing difference 
in data code length. ■ 

Maps that induce HMMs. In this section we will assume that we use a class of 
maps whose states we know form a Markov chain. 

Definition 16 (Feature Markov Process, #MP) Suppose that 

Pr(y„|$o(yi),...,$o(l/i:n)) = Pr(|/„|<l>o(yi;„)) 

and that the state sequence is Markov, i.e. 

Pr{<^o{yi..n)\Myi),--,Myi:n-l)) = Pr(<l>o(yi:„)|$o(2/l:n-l)). 

Then we say that $o induces an HMM. We call HMMs induced by $0? Feature 
Markov Process (^MP). If the HMM that is defined this way by $0 is the true 
distribution for the sequence yi,y2, ■■■, then we say that "$o is correct". 

We will only discuss the situation when the true HMM is ergodic so we will only 
say that there is a correct $0 ^'^ those situations, hence the statement $0 is correct 
will contain the assumption that the truth is ergodic. 

Example 17 The map $ which sends everything to the same state always induces 
an HMM but, unless the sequence yi,y2, ■■■ is i.i.d, it is not correct. <) 

Proposition 18 (Convergence of estimated distributions) // $0 is correct 
then Pq — )■ P^Q for ra — )• 00 (as distributions on finite strings of a (any) fixed 
length), where P^o is the true HMM distribution for the outcomes, Pe is the HMM 
distribution defined by 6 and 6^ are the parameters generated by $0. 

Proof. We are estimating the parameters On through maximum likelihood for the 
generated sequence of states. Consistency of maximum likelihood estimation for 
Markov chains implies that 6'„ — )■ 6'o. This implies the proposition due to continuity 
with respect to the parameters of the likelihood (for any finite sequence length). 

■ 

Proposition 19 (Inducing HMM implies drawing ergodic sequences) // 

we have a set of maps that induce HMMs and the sequence yt is drawn from one of 
the induced ergodic HMMs, then almost surely 

1. yt is HMM- ergodic 

2. we will draw an ergodic sequence yt with respect to the considered class of maps. 

Proof. 1. is a consequence of Theorem O 

2. This follows from consistency of maximum likelihood for Markov chains (gener- 
alized law of large numbers) since the claim is that state transition frequencies and 
emission frequencies converge. ■ 
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4 Maps based on Finite State Machines (FSMs) 

We will in this section consider maps of a special form that are related to PDFAs. 
We will assume that $ is such that there is a ip such that 

'^{yi-.n) = ^(*(Z/l;n-l),l/n). 

In other words, the current state is derived deterministically from the previous 
state and the current perception. Given an initial state the state sequence is then 
deterministically determined by the perceptions and therefore the combination of ip 
with an initial state defines a map $ from histories to states. This class of maps $ 
can also define a class of probabilistic models of the sequence ?/„ by assuming that 
i/n only depends on s.„_i = ^{yi-n-i)- This leads to the formula 

Pr{s'\s) = y^ Pr{y\s) 

y:ip{s,y)=s' 

and as a result we have defined an HMM for the sequence t/n- 

Definition 20 (Sampling from FSM) // we follow the procedure above we say 
that we have sampled the sequence yt from the FSM. If the Markov chain of states 
is ergodic we say that we have sampled yt ergodically from the FSM. 



SufRx Trees. We consider a class of maps based on FSMs that can be expressed 
using Suffix Trees |Ris86] with the same states (suffixes) as the FSM. The resulting 
models are sometimes called FSMX sources. A suffix tree is defined by a suffix set 
which is a set of finite strings. The set must have the property that none of the 
strings is an ending substring (a suffix) of another string in the set and such that any 
sufficiently long string ends with a substring in the suffix set. Given any sufficiently 
long string we then know that it ends with exactly one of the suffixes from the suffix 
set. If the suffix set furthermore has the property that given the previous suffix 
and the new symbol there is exactly one element (state) from the suffix set that 
can (and is) the end of the new longer string, then it is an FSMX source. Another 
terminology says that the suffix set is FSM closed. The property implies (directly 
by definition) that there is a map ip such that tp{st-i,yt) = St. 

The following proposition shows a very nice connection between ergodic se- 
quences and FSMX sources which will be generalized in Proposition [25] to more 
general sources based on bounded-memory FSMs. 

Proposition 21 (ergodicity of sufRx trees) // we have a set of maps based on 
FSMs that can be expressed by suffix trees, and the sequence yt is sampled ergodically 
(Definition\2D\) using one of the maps, then almost surely we get a sequence yt that 
is ergodic with respect to the considered class of maps and yt is HMM-ergodic. 

9 



Lemma 22 // the sequence yt is ergodic, then the state transition frequencies and 
emission (of y) frequencies for a FSM closed suffix tree are converging. 

Proof. Let the map $ be defined by the suffix set in question. Suppose that s' is a 
suffix that can follow directly after s. This means that there is a symbol y such that 
if you concatenate it to the end of the string s, then this new string s ends with the 
string s'. This means that whenever a string of symbols ?/i;„ ends with s, then the 
sequence of states generated by applying the map $ to the sequence yi.n will end 
with Sn-i = s and s„ = s'. It is also true that whenever the state sequence ends 
with ss' then yi-^ ends with s. Therefore, the counts (of ss' in the state sequence 
and s in the y sequence) up until any finite time point are also equal. We will in 
this proof say that s is the string that corresponds to ss'. 

Given any ordered pair of states (s, s') where s' can follow s, let Cs^s'{n) be the 
number of times ss' occurs in the state sequence up to time n and let ds^s'{n) be the 
number of times the string s that corresponds to ss' has occurred. We know that 
Cs,s'{n) = ds,s'{n) for any such pair ss' and any n. If s' cannot follow s we let both 
Cs^s' = and dg^s' = 0. The state transition frequency for the transition from s to s' 
up until time n is 

Cs,s'{n) ds^s'in) ds^s'in) 4,s'W ^ 



where ds{n) is the number of times that the string that defines s has occurred up 
until time n in the y sequence. The right hand side converges to the frequency of the 
string s divided by the frequency of the string that defines s. Thus we have proved 
that state transition frequencies converge. Emissions work the same way. ■ 

Lemma 23 // we sample yt ergodically from a suffix tree FSM, then the frequency 
for each finite substring will converge almost surely. In other words the sequence yt 
is almost surely ergodic. 



Proof. If the suffix tree defines an FSM as we have defined it above, the states of the 
suffix tree will form an ergodic Markov chain. An ergodic Markov chain is stationary. 
For any state and finite string of perceptions there is a certain fixed probability of 
drawing the string in question. The frequency of the string str is ^^ Pr{s)Pr{str\s) 
where Pr{s) is the stationary probability of seeing s and Pr{str\s) is the probability 
of directly seeing exactly str conditioned on being in state s. It follows from the 
law of large numbers that the frequency of any finite string str converges. 

Another way of understanding this result is that it is implied by the convergence 
of the frequency of any finite string of states in the state sequence. ■ 

Proof, of Proposition 1211 Lemma [22] and Lemma [23] together imply the propo- 
sition since they say that if we sample from a suffix tree then we almost surely get 
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converging frequencies for all finite substrings and this implies converging transition 
frequencies for the states from any suffix tree. ■ 

Bounded-Memory FSMs. We here notice that the reasons that the suffix tree 
theory above worked actually relate to a larger class, namely a class of FSMs where 
the internal state is determined by at most a finite number of previous time steps 
in the history. 

Definition 24 (bounded memory FSM) Suppose that there is a constant k, such 
that if we know the last k, + 1 perceptions yt-K,---,yt then the present state St is 
uniquely determined. Then we say that the FSM has memory of at most length k, 
(not counting the current) and that it has bounded memory. 

Proposition 25 (ergodicity of FSMs) 1. Consider a sequence yt whose finite 
substring frequencies converge (i.e. the sequence is ergodic) and an FSM of bounded 
memory, then the sequence is ergodic with respect to the map defined by the FSM. 
2. If we sample a sequence yt ergodically from an FSM with bounded memory then 
almost surely yt is HMM-ergodic and its finite substring frequencies converge. 

Proof. The proof works the same way as for suffix tree FSMs. If an FSM has finite 
memory of length k then there is a suffix tree of that depth with every suffix of full 
length and every state of the FSM is a subset of the states of that suffix tree. The 
FSM is a partition of the suffix set into disjoint subsets. Every state transition for 
the FSM is exactly one of a set of state transitions for the suffix tree states and the 
frequency of every ordered pair of suffix tree states converge almost surely as before. 
Therefore, the state transition frequencies for the FSM will almost surely converge. 
A distribution that is defined using an FSM of bounded memory can also be 
defined using a suffix tree, so 2. reduces to this case ■ 

5 The Main Result For Sequence Prediction 

In this section we summarize our results in a main theorem. It follows directly 
from a combination of results in previous sections. They are stated with respect 
to our main class of maps, namely the class that is defined by bounded-memory 
FSMs. The generating models that we consider are models that are defined from a 
map in this class in such a way that the states form an ergodic Markov chain. We 
refer to this as sampling ergodically from the FSM. Our conclusion is that we will 
under these circumstances eventually only choose between maps which generate the 
best possible HMM parameters that can be achieved for the purpose of long-term 
sequence prediction. The model penalty term will inffuence the choice between these 
options towards simpler models. 

The following theorem guarantees that we will almost surely asymptotically find 
a correct HMM for the sequence of interest under the assumption that it is possible. 
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Theorem 26 // we consider a finite class of maps $j, i = 0, 1, ..., A; based on finite 
state machines of bounded memory and if we sample ergodically from a finite state 
machine of bounded memory, then there almost surely exist limiting parameters 9i 
for all i and there is N < oo such that for all n > N the map $j selected at time 
n > N by minimizing Cost, generates parameters whose limit is 9q which is assumed 
to be the optimal HMM parameters. 

Proof. We are going to make use of Proposition [25] together with Theorem [T5l 
Proposition [25] shows that our assumptions imply the assumptions of Theorem [15] 
which provides our conclusion. ■ 

Extension to countable classes. To extend our results from finite to countable 
classes of maps we need the model complexity penalty to be sufficiently rapidly 
growing in n and m. This is also necessary if we want to be sure that we eventually 
find a minimal representation of the optimal model that can be achieved by the class 
of maps. 

Proposition 27 (Consistency for countable class) Suppose that we have a 
countable class of maps ^i, i = 0,1, ... and 

1. Suppose that our class is such that for every finite k, there are at most finitely 
many maps with at most k states. 

2. Suppose that 6q is an optimal HMM for the sequence yt, that it has m 
states and that 9q is the limit of the parameters generated by •I'o- Further- 
more, suppose that there is finite N such that whenever n > N, m > m 
and 6 is any HMM with m states we have pen{n,m) — logPg„(|/i, ...,y„) < 

pen{n,m) — logPg{yi, ...,yn). where 6q are the parameters generated by $o- 

then Theorem\l^ is true also for this countable class and we will furthermore even- 
tually pick a map with at most m states. 

Proof. The idea of the proof is to reduce the countable case to the finite case that 
we have already proven by using that when n > N we will never pick a $ with 
more than m states and then use the first property to say that the remaining class if 
finite. This reduction also shows that we will eventually not pick a map with more 
states than m. ■ 

The first property in the proposition above holds for the class of suffix trees 
and for the class based on FSMs with bounded memory. The second property, 
but with the HMM maximum likelihood parameters 6{n) with m states (while 
we have ML for a sequence of states and observations) will almost surely hold 
if the penalty is such that we have strong consistency for the HMM criteria 
9* = argmaxlog Pe{yi, ...,yn) — pen{n,m). This is studied in many articles, e.g. 
|GB03j where strong consistency is proven for a penalty of the form /3(m)logn 
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where /9 is a cubic polynomial. Note that in the case without side information (if 
our map has the properties that $o(l/i:n) determine i/n and that $(|/„_i) and y„ 
determine $(?/i:„)) the emissions are deterministic and the state sequence generated 
by any map is determined by the y sequence. This puts us in a simpler situation 
akin to the Markov order estimation problem |FLN96t ICSOOj where it is studied 
which penalties (e.g. BIG) will give us property 2. above. 



Conjecture 28 We almost surely have Property 2. from Proposition 21 for the 
BIC penalty studied in ]CS0O^ . 



6 Sequence Prediction With Side Information 

In this section we will broaden our problem to the setting where we have side in- 
formation available to help in our prediction task. In our problem setting we have 
two finite sets X and 3^, a sequence Pn = ixn,yn) of elements from X x y, and we 
are interested in predicting the future of the sequence ?/„. To do this we first want 
to learn a feature map $(pi;„) = s„. In other words we want our current state to 
summarize all useful information from both the x and y sequence for the purpose of 
predicting the future of y only. 

One obvious approach is to predict the future of the entire sequence p, i.e. pre- 
dicting both X and y and then in the end only notice what we find out about y. This 
brings us back to the case we have studied already, since from this point of view 
there is no side information. A drawback with that approach can be that we create 
an unnecessarily complicated state representation since we are really only interested 
in predicting the y sequence. 

In the case when there is no side information, St = $(2/i:t). An important 
difference of the case with side information is that the sequence Si^t depends on 
both yi;t and Xi;t- Therefore for the latter case, if we would like to consider a 
distribution for y only, yi,...,yn does not determine the state sequence Si, ...,s„: 

Pr{yi,...,yn\9n) = ^ Pr{si, ...,Sn)Pr{xi, ...,Xn,yi, ...,yn\si, ...,Sn,9n). 

This is expression is of course also true in the absence of side information x, but 
then the sum collapses to one term since there is only one sequence of states Si-n 
that is compatible with yi-^. 

An alternative to using the Cost criteria on the p sequence is to only model the 
y sequence and let 

L„(<l>) = - log Pr{y I,..., yn\9n) 

and then define Cost in exactly the same way as before. This cost function was 
called ICost in [Hut09j . 
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Theorem 29 Theoreml^M is true for sequence prediction with side information us- 
ing 

ICostni^i) = -logPr{yi,...,yn\9n) +pen{n,S) 

if we define "sample ergodically" to refer to the sequence pt = {xt,yt) instead of yt- 

Proof. The proofs work exactly as they are written for the case without side 
information. ■ 

Note that a map that is optimal for predicting the y sequence can have fewer 
states than a minimal map that can generate the model of the p sequence. 

It is interesting to note that the interpretation of this result is not as clear as the 
case without side information. It guarantees that, given enough history, the chosen 
$ can and will (with the asymptotic parameters) define the correct model for the yt 
sequence but the Xt sequence has only played a part in the estimation and we are 
not guaranteed that we will make use of the extra information if it does not impact 
the entropy rate. In particular it is true if the information in Xt is only helpful for 
a finite number of time steps forward. In this case that gain will not affect the 
entropy rate which is a limit of averages. We have a more conclusive result for the 
case with side information when we use the first mentioned approach of applying 
Cost to the sequence p, since we proved consistency in the previous section in the 
sense of finding the true model when possible. 

If we have injective maps $, e.g. maps defined by non-empty suffix trees, then 
we can rewrite Cost in a form that was used in |Hut09j also more generally. Therein 
a cost called original cost was defined as follows: 

Definition 30 (OCost) 

OCost = -logPr(si,...,s„) -logPr{yi,...,yn\si,...,Sn,9n) +pen{n,S). 

Remark 31 // $j is injective and we calculate Cost in the side information case 
then Cost = OCost. o 

If we have no side information both OCost and ICost will be the same as Cost 
but they may differ when there is side information available. We remarked above 
that if we consider only injective $ (e.g. non-empty suffix tree based maps) then 
OCost equals using Cost on the joint sequence pt = {xt,yt)- As noted in |Hut09] 
OCost penalizes having many states more than ICost does and when considering 
non-injective $ one risks getting a smaller than desired state space. 

7 The Active Case 

In this very brief section we will discuss how to map the active case to the previously 
introduced notions. The active case will be treated in depth in future articles. In 
the active case |RN10t ISB98J we have an agent that interacts with an environment. 
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The agent perceives observations Ot and real-valued rewards rt and the agent takes 
actions at from a finite set of possible actions A with the goal of receiving high total 
reward in some sense. We will denote the events that have just occurred when the 
agent will take an action at time step t, i.e. at, Ot, and r^ by e^. We consider maps 
based on FSMs (PDFAs) that takes event sequences Ct as input. In the previous 
section's notation Xt = (oj, aj and yt = rt and pt = e^. We chose this since we are 
interested in predicting which future rewards will result from actions chosen with 
the help of the observations. This would give us the possibility of determining which 
actions will earn the highest rewards. 

At time t — 1 the past ei, ..., et-i determines St-i and the agent takes an action 
at-i and Ot and rt are generated according to distributions that only depend on st-i 
and ai_i. Then we have generated e^ and Sj = ipi^t-i, Ct). 

Definition 32 The above describes what we mean when we say that the FSM gener- 
ates the environment. We say that the FSM generates the environment ergodically, 
if for any sequence of actions chosen such that the action frequencies for any state 
converge asymptotically, we will have state transitions and emission frequencies that 
converge almost surely to an ergodic HMM. 

Proposition 33 Suppose that we have an FSM of bounded-memory generating the 
environment ergodically and the action frequencies for any state converge asymp- 
totically, then we will almost surely generate an ergodic sequence of events and the 
reward sequence is HMM-ergodic. 

Proof. The situation reduces through Definition |32] to that of Proposition l25l ■ 

Theorem 34 // we consider a finite class of maps $j, i = 0, 1, ..., A; based on finite 
state machines of bounded memory and if the environment is generated ergodically 
by a finite state machine of bounded memory and if the action frequencies for any 
internal state of the generating finite state machine converge, then there almost 
surely exist limiting state transition parameters 6i for all i and there is N < oo such 
that for alln > N the map $j selected by minimizing ICost at time n > N generates 
parameters whose limit is 6q which is the optimal HMM. 



Proof. We combine Proposition [33] with Theorem | 

How to choose the actions to make the implications for reinforcement learning 
what we want them to be is the subject of ongoing work |Hut09] . 

8 Conclusions 

Feature Markov Decision Processes were introduced [Hut 09] as a framework for 
creating generic reinforcement learning agents that can learn to perform well in 
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a large variety of complex environments. It was introduced as a concept without 
theory or empirical studies. First empirical results are reported in [MahlOj . Here we 
provide a consistency theory by focusing on the sequence prediction case with and 
without side information. We briefly discuss the active case where an agent takes 
actions that may affect the environment. The active case and empirical studies is 
the subject of ongoing and future work. 
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